AmyloGram: Analysis of proteins in R

Jarek Chilimoniuk

Department of Bioinformatics and Genomics, University of Wroclaw

Proteins

What are proteins and what do they do?


  • Antibody
  • Enzyme
  • Messenger
  • Structural component
  • Transport/storage

Amino acids

Amino acids

Proteins

Proteins

Proteins

Protein higher order structures determines its function.

Human proteom

1937 human proteins have unknown role (dark proteome) (Young-Ki Paik et al., 2018).

Goal

Development of methods for predicting protein properties on the basis of their primary structure in a way that is understandable for biologists and experimentally validated.

n-grams and reduced alphabets

n-grams (k-tuple, k-mers):

  • subsequences (continuous or discontinuous) n amino acid or nucleotide residues,
  • more informative than the individual residues


Encoding of amino acids into n-grams is for the purposes of Machine Learning.

Peptide I: FKVWPDHGSG

Peptide II: YMCIYRAQTN

n-gram examples from peptide I and II:

  • 1-gram: F, Y, K, M,
  • 2-gram: FK, YM, KV, MC,
  • 2-gram (discontinuous): F-V, Y-C, K-W, M-I,
  • 3-gram (discontinuous): F–WP, Y–IY, K–PD, M–YR.

Longer n-grams are more informative, but create larger attribute spaces that are more difficult to analyze.

slam: Sparse Lightweight Arrays and Matrices

Counting n-grams creates sparse matrices, that are causing dimensional problems.

slam: Sparse Lightweight Arrays and Matrices

Number of sparse matrices Package File size [Mb]
1 base 0.000214 Mb
1 slam 0.001122 Mb
10 base 0.000969 Mb
10 slam 0.001312 Mb
100 base 0.0765 Mb
100 slam 0.002625 Mb
1000 base 7.629601 Mb
1000 slam 0.016357 Mb
10000 base 762.939659 Mb
10000 slam 0.153687 Mb

QuiPT

Quick Permutation Test is a fast alternative to permutation tests for n-gram data. It also allows precise estimation of p-value.

QuiPT is avaible as part of the biogram R package.

Reduced alphabets

  • amino acids are grouped into larger yields on the basis of specific criteria,
  • easier anticipation of structures (Murphy, Wallqvist, and Levy 2000),
  • creation of more generalised models,
  • feature engineering.

Reduced alphabets

Following peptides appear to be completely different in terms of amino acid composition.


Peptide I:

FKVWPDHGSG


Peptide II:

YMCIYRAQTN

Pattern searching

Group Amino acids
1 C, I, L, K, M, F, P, W, Y, V
2 A, D, E, G, H, N, Q, R, S, T





Peptide I:        FKVWPDHGSG        —–>        1111122222

Peptide II:        YMCIYRAQTN         —–>        1111122222

Amyloid prediction

Amyloids

Amyloid aggregates are found in tissues of people suffering from neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease and many other diseases.

Amyloid aggregates (red) around neurons (green). Strittmatter Laboratory, Yale University.

Amyloids

Source: National Institute on Aging (NIA) | National Institutes of Health (NIH)

Amyloid proteins

Peptide sequences with amyloidogenic properties are responsible for the aggregation of amyloidogenic proteins (hot spots):

  • short (6-15 amino acids),
  • very variable, usually hydrophobic amino acid composition,
  • create unique \(\beta\)-structures.

(Sawaya et al. 2007)

AmyloGram: n-gram-based amyloid prediction tool




Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

ranger: A Fast Implementation of Random Forests

Package Runtime [h] Memory usage [GB]
mtry=
5000 15,000 135,000
randomForest 101.24 116.15 248.60 39.05
randomForest (MC) 32.10 53.84 110.85 105.77
bigrf NA NA NA NA
randomForestSRC 1.27 3.16 14.55 46.82
Random Jungle 1.51 3.60 12.83 0.40
Rborist NA NA NA >128
ranger 0.56 1.05 4.58 11.26
ranger (save.memory) 0.93 2.39 11.15 0.24
ranger (GWAS mode) 0.23 0.51 2.32 0.23

Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77

Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77

Cross-validation

Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Standard reduced alphabets

Do standard reduced alphabets developed for different biological issues help to improve amyloid prediction?

Standard reduced alphabets


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Standard reduced alphabet

Standard amino acid alphabets do not improve the quality of amyloid prediction.


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Novel reduced amino acid alphabets

17 measures handpicked from AAIndex database:

  • size of residues,

  • hydrophobicity,

  • solvent surface area,

  • frequency in \(\beta\)-sheets,

  • contactivity.

524 284 reduced amino acid alphabets with different level of amino acid alphabet reduction (three to six amino acid groups).


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Novel reduced amino acid alphabets

Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Reduced alphabets


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Selection of best-performing reduced alphabet

Selection of best-performing reduced alphabet

For each category the alphabets have been ranked (rank 1 for the best AUC, etc.).

Selection of best-performing reduced alphabet

The best alphabet was the one with the lowest rank sum.

Best-performing reduced alphabet


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Best-performing reduced alphabet

Group Amino acids
1 G
2 K, P, R
3 I, L, V
4 F, W, Y
5 A, C, H, M
6 D, E, N, Q, S, T

Group 3 & 4 - hydrophobic amino acids.

Group 2 - amino acids disrupting the \(\beta\)-structure (\(\beta\)-breakers).


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Alphabet similarity and quality of prediction

Is the best-performing reduced amino acid alphabet associated with amyloidogenicity?

Similarity index

Similarity index (Stephenson and Freeland 2013) measures the similarity between two reduced alphabets (1:~identical alphabets, 0:~completely dissimilar alphabets).

Similarity index

The correlation between the similarity index and the average AUC is important (\(\textrm{p-value} \leq 2.2^{-16}\); \(\rho = 0.51\)).

Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Are informative n-grams found by QuiPT associated with amyloidogenicity?

Informative n-grams

Of the 65 most informative n-grams, 15 (23%) are also present in amino acid motifs found experimentally (Paz and Serrano 2004).

Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

Benchmark results


Program AUC MCC
AmyloGram 0.8972 0.6307
PASTA 2.0 (Walsh et al. 2014) 0.8550 0.4291
FoldAmyloid (Garbuzynskiy, Lobanov, and Galzitskaya 2010) 0.7351 0.4526
APPNN (Família et al. 2015) 0.8343 0.5823


The classifier trained using the best reduced alphabet, AmyloGram, has been compared with other amyloid prediction tools using an external dataset .

MCC (Matthew’s Correlation Coefficient) measures the performance of a classifier (1 - classifier always properly recognizes amyloid proteins, -1 - classifier never properly recognizes amyloid proteins)


Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961

New amyloid

A new functional amyloid produced by Methanospirillum sp. (Christensen et al. 2018) was selected for analysis by AmyloGram.

Shiny aplication

AmyloGram web server

AmyloGram web server

AmyloGram web server

Summary

Summary

Models predicting the properties of proteins may be based on precise rules that are understandable to biologists and experimentally verifiable without losing their effectiveness.

Acknowledgements

  • Michał Burdukiewicz (Warsaw University of technology).

  • Małgorzata Kotulska (Wrocław University of Science and Technology).

  • Stefan Rödiger (Brandenburg University of Technology Cottbus-Senftenberg).

  • Paweł Mackiewicz (University of Wrocław).

  • Piotr Sobczyk (Wrocław University of Science and Technology).

Acknowledgements

Funding:

  • Polish National Science Centre (2015/17/N/NZ2/01845 & 2017/24/T/NZ2/00003).

  • COST ACTION CA15110 (Harmonising standardisation strategies to increase efficiency and competitiveness of European life-science research).

  • KNOW Wrocław Center for Biotechnology.

  • German Federal Ministry of Education and Research (InnoProfile-Transfer-Projekt 03IPT611X).

Web servers


Web servers:

R packages:

References

Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2016. “Prediction of Amyloidogenicity Based on the N-Gram Analysis.” e2390v1. PeerJ Preprints. https://peerj.com/preprints/2390.

Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2017. “Amyloidogenic Motifs Revealed by N-Gram Analysis.” Scientific Reports 7 (1): 12961. doi:10.1038/s41598-017-13210-9.

Christensen, Line Friis Bakmann, Lonnie Maria Hansen, Kai Finster, Gunna Christiansen, Per Halkjær Nielsen, Daniel Erik Otzen, and Morten Simonsen Dueholm. 2018. “The Sheaths of Methanospirillum Are Made of a New Type of Amyloid Protein.” Frontiers in Microbiology 9: 2729. doi:10.3389/fmicb.2018.02729.

Família, Carlos, Sarah R. Dennison, Alexandre Quintas, and David A. Phoenix. 2015. “Prediction of Peptide and Protein Propensity for Amyloid Formation.” PLOS ONE 10 (8): e0134679. doi:10.1371/journal.pone.0134679.

Garbuzynskiy, Sergiy O., Michail Yu Lobanov, and Oxana V. Galzitskaya. 2010. “FoldAmyloid: A Method of Prediction of Amyloidogenic Regions from Protein Sequence.” Bioinformatics (Oxford, England) 26 (3): 326–32. doi:10.1093/bioinformatics/btp691.

Murphy, Lynne Reed, Anders Wallqvist, and Ronald M. Levy. 2000. “Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding.” Protein Engineering 13 (3): 149–52. doi:10.1093/protein/13.3.149.

Paz, Manuela López de la, and Luis Serrano. 2004. “Sequence Determinants of Amyloid Fibril Formation.” Proceedings of the National Academy of Sciences 101 (1): 87–92. doi:10.1073/pnas.2634884100.

Sawaya, Michael R., Shilpa Sambashivan, Rebecca Nelson, Magdalena I. Ivanova, Stuart A. Sievers, Marcin I. Apostol, Michael J. Thompson, et al. 2007. “Atomic Structures of Amyloid Cross-β Spines Reveal Varied Steric Zippers.” Nature 447 (7143): 453–57. doi:10.1038/nature05695.

Stephenson, James D., and Stephen J. Freeland. 2013. “Unearthing the Root of Amino Acid Similarity.” Journal of Molecular Evolution 77 (4): 159–69. doi:10.1007/s00239-013-9565-0.

Walsh, Ian, Flavio Seno, Silvio C. E. Tosatto, and Antonio Trovato. 2014. “PASTA 2.0: An Improved Server for Protein Aggregation Prediction.” Nucleic Acids Research 42 (W1): W301–W307. doi:10.1093/nar/gku399.